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We introduce NSeq, a fast and efficient Java application for finding positioned 
nucleosomes from the high-throughput sequencing of MNase-digested 
mononucleosomal DNA. NSeq includes a user-friendly graphical interface, computes 
false discovery rates (FDRs) for candidate nucleosomes from Monte Carlo simulations, 
plots nucleosome coverage and centers, and exploits the availability of multiple processor 
cores by parallelizing its computations. Java binaries and source code are freely available 
at https://github.com/songlab/NSeq. The software is supported on all major platforms 
equipped with Java Runtime Environment 6 or later. 
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1. INTRODUCTION 

Eukaryotic DNA is organized into chromatin, consisting of 
repeating nucleosomes adjoined by linker DNA. A nucleo- 
some itself is composed of ~ 146 bp of DNA wound around an 
octameric histone core. The core histones participate in the epi- 
genetic regulation of gene expression in two important ways: 
they can block access to regulatory sequences by DNA-binding 
factors; and covalent modifications of their N-terminal tails 
can affect the recruitment of protein complexes, which in turn 
influences transcriptional regulation. Although nucleosomes are 
highly dynamic, being subjected to thermal fluctuations and 
ATP-dependent remodeling (Blossey and Schiessel, 2011), some 
nucleosomes are well-localized across populations of a given cell 
type. Such positioned nucleosomes are likely to be more func- 
tional than delocalized nucleosomes and may be under selection 
and regulatory forces (Yuan et al, 2005; Ozsolak et al., 2007; Field 
et al, 2008; Song et al, 2008). 

This paper introduces NSeq, an open-source Java application 
that rigorously identifies positioned nucleosomes from the next- 
generation sequencing of micrococcal nuclease (MNase)-digested 
mononucleosomal DNA. MNase preferentially cuts linker DNA, 
leaving nucleosomal DNA largely intact. This ideally gives rise to 
clusters of reads on both sides of a positioned nucleosome, with 
the mean 5'-end positions of reads in the forward- and reverse- 
strand clusters separated by ~ 146 bp. NSeq uses a novel statistical 
test to identify positioned nucleosomes from these reads. 

The organization of the rest of this paper is as follows. The 
next two sections highlight the distinctive features of NSeq and 
compare it with competing nucleosome sequencing offerings. 
The Usage section serves as a short user's guide to our software. 
The "Methods" section provides a detailed description of NSeq's 
algorithm. 



2. DISTINCTIVE FEATURES 

Publicly available software that also finds nucleosomes from 
sequencing data includes NOrMAL (Polishko et al, 2012), 
Template Filter (Weiner et al, 2010), PING (Zhang et al, 2012), 
nucleR (Flores and Orozco, 2011), and NPS (Zhang et al, 2008a). 
NSeq has several advantages over these alternatives: 

• NSeq automatically controls for false positive positioned nucle- 
osome calls and computes a false discovery rate (FDR). A 
candidate nucleosome is excluded if its FDR is above a user- 
specified cutoff. NOrMAL, Template Filter, and PING asso- 
ciate measures of confidence with nucleosomes, but their 
connection to FDR, if any, must be inferred by the user. 

• NSeq has both user-friendly graphical and command-line 
interfaces. NOrMAL, NPS, and Template Filter are solely 
command-line utilities, while nucleR and PING run in R. 

• NSeq accepts alignment data in BAM, SAM, and BED file 
formats. Template Filter and NOrMAL use non-standard file 
formats. NPS opens only BED files. PING and nucleR sup- 
port input data in BAM, SAM, and BED formats through 
shortread, an R/Bioconductor package. 

• Unlike other software discussed here, NSeq has an integrated 
plotting tool that displays nucleosome coverage and center 
positions as well as raw read positions. NSeq also outputs a 
WIG file with nucleosome center positions. 

• NSeq is multithreaded: it exploits the availability of multicore 
processors, parallelizing its nucleosome search and FDR com- 
putations to improve performance. PING also supports parallel 
processing of input data (through the R package snowfall), 
but others do not. 

• NSeq is fast. We ran NSeq and NOrMAL on chromosome 10 
of human data (Schones et al, 2008) for default values of all 
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parameters on an Intel i5 2.7 GHz CPU. NSeq took 133 s to 
process the chromosome, while NOrMAL took 9.98 days. 

3. SOFTWARE COMPARISON 

Figure 1 compares the features of the aforementioned software 
packages. Many of these features pertain to usability. NOrMAL 
was released in June 2012, and it is currently the latest pub- 
licly available nucleosome sequencing analysis software at the 
time of this manuscript's preparation. In its accompanying paper 
(Polishko et al., 2012), Template Filter is described as the current 
state-of-the-art in nucleosome detection software. The remain- 
der of this section compares the accuracy of NSeq, NOrMAL, and 
Template Filter in finding positioned nucleosomes. 

Template Filter comes with seven characteristic distributions 
("templates") of reads flanking a nucleosome. Each template is 
designed to match a pattern of reads on a single strand that 
ostensibly indicates the presence of a nucleosome. So there are 
7 x 7 = 49 combinations of forward- and reverse-strand tem- 
plates. The software advances a sliding window across a genome 
to analyze read count data, computing cross-correlations with 
the 49 combinations of forward- and reverse-strand templates for 
various spacings between template pairs. This yields a correlation 
heat map for each template pair, with template pair spacing and 
window position on the axes. Local maxima are associated with 
candidate nucleosomes. A greedy algorithm then selects the best 
assignment of nucleosomes. 

The seven templates that come with Template Filter were 
obtained by applying the procedure outlined in the previous para- 
graph to a Saccharomyces cerevisiae dataset, but using a single 
Gaussian-shaped template. That is, a single cross-correlation was 
calculated for a given window using Gaussian-shaped ansatzes to 
characterize the forward- and reverse-strand read accumulations. 



The read patterns of these nucleosomes were then clustered using 
fc-means clustering. Each of the seven templates was chosen from 
a different cluster of read patterns. 

NOrMAL uses a mixture model of k Gaussians per chro- 
mosome to probabilistically model nucleosome occupancy and 
applies an expectation-maximization (EM) algorithm to learn the 
parameters from read count data. Each Gaussian corresponds 
to a candidate nucleosome. NOrMAL's output associates confi- 
dence and fuzziness scores with each nucleosome. The fuzziness 
scores are parameters from the mixture model. The lower the 
fuzziness score, the better-positioned a nucleosome; the lower the 
confidence score, the more likely a nucleosome is a false discovery. 

The number k of Gaussian clusters in NOrMAL's mixture 
model is found by following these steps: 

1. k is set equal to the size of a chromosome divided by the 
expected size of a nucleosome, an underestimated parameter 
specified by the user. 

2. The EM algorithm mentioned above is run until it converges. 

3. Distances between Gaussian clusters are checked, and clusters 
are merged if they overlap above a threshold input by the user. 

Steps 2 and 3 are repeated until clusters are no longer merged. 

We ran Template Filter, NOrMAL, and NSeq on nucleosome 
sequencing data for S. cerevisiae (Tsankov et al., 2010). Default 
values of all parameters were used. NSeq found 28,896 positioned 
nucleosomes in the data. NOrMAL found 49,218 nucleosomes, 
and the distribution of their fuzziness scores peaked at 15; we 
thus considered as positioned those nucleosomes whose fuzziness 
scores were less than 15. We then simulated delocalized nucle- 
osomes as follows: for each read, a random integer was drawn 
from the uniform distribution on {— 73, . . . , 73} and added to its 
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FIGURE 1 | Chart comparing features of NSeq and other publicly available nucleosome sequencing software. 
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position. Since a sharply positioned nucleosome is associated with 
tightly clustered reads, shaking the original data in this random 
manner removes the characteristic signatures of localized nucle- 
osomes; a good algorithm for detecting positioned nucleosomes 
should not call many candidates in such simulated data. We stress 
that our simulations do not merely add noise to the original data; 
they effectively reconstruct the data so the signal-to-noise ratio is 
nearly zero. While a good nucleosome detector should be robust 
to noise, it should not often mistake noise for signal. 

Denote as C(f, s) the criteria that fuzziness is less than f 
and confidence score is greater than s. FDRs were computed for 
NOrMAL as the average ratio of the total number of nucleo- 
somes satisfying C in simulated data to that in the original data. 
The FDRs corresponding to C(15, s) and C(30, s) are displayed 
in Figure 2. Note that the FDR decreases to close to zero for 
fuzziness scores <15 as the confidence score increases, indicating 
that low fuzziness score and high confidence score are generically 
associated with true positive positioned nucleosomes. An FDR 
was computed for NSeq just as for NOrMAL, but without the 
criteria C, giving 1.31 x 10~ 3 %. 

Template Filter found 64,990 nucleosomes in the original 
S. cerevisiae data, and the distribution of their correlation coef- 
ficients peaked at 0.9. Denote as D(F, R) the criteria that a 
nucleosome has correlation coefficient >0.9, and is associated 
with template F on the forward-strand and template R on the 
reverse-strand, with F, R e {1, . . . , 7}. We computed FDRs for 
Template Filter as for NOrMAL but with the criteria D rather 
than C. The median FDR across the 49 forward-reverse template 
combinations was 36.5%, with a median absolute deviation of 
8.50%. The minimum FDR of 25.3% occurred for D(4, 4). 

Template Filter thus lacks the efficacy of both NSeq 
and NOrMAL for reliably identifying positioned nucleo- 
somes, and NOrMAL requires the user to manually estimate 
the FDR. 

4. USAGE 

NSeq is distributed as a Java Archive (NSeq.jar) and can be run 
on any machine equipped with Java Runtime Environment 6 or 
later. Netbeans IDE 7.0.1 was used to design the graphical user 
interface (GUI), which consists of standard Swing components. 
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FIGURE 2 | FDRs computed from NOrMALOs results using simulations 
described in text at fuzziness score thresholds 30 (blue) and 15 (red). 



To start the GUI and set a maximum Java Virtual Machine heap 
size of 2 GB, enter 

java -jar -Xmx2g NSeq.jar 

at any Windows or Unix-like command prompt. For processing 
large genomes like human and mouse, NSeq should be run with 
a maximum heap size of at least 10 GB: 

java -jar -XmxlOg NSeq.jar. 

4.1. STARTUP SCREENS 

The opening screen (Figure 3A) explains the purpose of the 
application and the inputs required of the user. NSeq analyzes 
alignment data in BAM, SAM, or BED format. It assumes that 
the data are single-end. To facilitate fast reading of aligned data, 
a tab-separated value (TSV) file with chromosome lengths is 
required for genome assemblies other than celO, mm9, mmlO, 
hgl8, and hgl9. This is a text file with reference sequence names 
(e.g., chrl) in its first column and chromosome lengths (e.g., 
249,250,621) in its second column; it is used to preallocate mem- 
ory for storing read data. Click "Get started" to proceed to 
the load screen (Figure 3B). Here, the user specifies the loca- 
tions of the chromosome-length and alignment files, as well as 
parameters used by NSeq in its nucleosome search. The num- 
ber T of threads controls the extent to which the computation 
is parallelized. When T = 1, the computation is not parallel. We 
recommend using a value of T at least as large as the num- 
ber of available CPU cores; we found that computations are 
fastest when T is about twice the number of available cores. 
The default FDR cutoff F is 0.01. If the FDR computed for 
a given candidate nucleosome is above F, the nucleosome is 
excluded from all results. Other parameters that can be tog- 
gled on the load screen are explained in the gray help box as 
well as in "Methods." Most users will find the default settings 
adequate. 

4.2. ANALYZING ALIGNMENT DATA 

NSeq starts searching for nucleosomes using the alignment data 
after "Run" is clicked on the load screen. Status updates are dis- 
played in a text box (Figure 3C). Each chromosome is divided 
into overlapping intervals; chromosomes are analyzed interval by 
interval, and different intervals are assigned to different threads. 
When NSeq has finished its analysis, two files are written in the 
same directory as the alignment data. One is a text (TXT) file with 
genomic coordinates and FDR estimates of nucleosome centers, 
which correspond to the centers of scan windows with non-zero 
N-statistics. (Our algorithm is described in detail in "Methods") 
The second file written is a WIG file that can be opened in the 
Integrated Genome Browser (Nicol et al., 2009). It contains nucle- 
osome centers and their associated triangle statistics. The parts 
of the filenames that precede their extensions have the format 
[alignment filename] _ [ datestamp ] _ [ timestamp ] . 

4.3. DISPLAYING RESULTS 

Clicking "Display" on the run screen brings up a histogram of 
raw read positions and an overlay of nucleosome center positions 
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a WIG file, which can be opened in the Integrative Genomics Viewer 
(http://www.broadinstitute.org/igv/). Information about methods is available at 
http://songlab.ucsf.edu/ . You'll need: 

1 . A BAM, SAM. or BED file with sequencing data. 

2. (For genome assemblies other than ce10, mm9, mm10, hg18, and hg19) A TSV file 
with coordinate lengths of chromosomes: its first column should contain reference 
sequence names (e.g., chrl ). and its second column should contain their 
corresponding lengths (e.g., 249250621). 
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Chromosome length file: 

/Users/anellore/NSeq3/Example/GSM552910_ScerevisiaeJ J en 
gth.txt 

Total genome length = 12156678 
Sequence File: 

/Users/anellore/NSeq3/Example/GSM552910_Scer_CM_Jul2108 
chrl .bed 



Expert options are in pink . 

They are explained in more detail at http://songlab.ucsf.edu. 

-The chromosome length Tile is a TSV file with reference sequence names (e.g., chrl) in its first column and 
genomic -coordinate lengths (e.g.. 249250621) in its second column. Chromosome lengths for some 
genome assemblies are included. 



■Using more threads requires more memory but improves performance ■■ up to a point. A good rule of thumb 
is to use twice the number of threads as there are processor cores. 



•Increasing the number of simulations improves the accuracy of FDR computations at the expense of 
performance. 

•The scan- window width is the size of the interval ol a chromosome in genomic coordinates spanned at once 
during statistical computations. It is essentially the field of view of the triangle statistic used to distinguish 
nucleosomes from biker DNA. The default width. 200 bp. includes the length of a ful mononuctoosomo 
(-146 bp) along with padding to accommodate minimal nucleosome movement. 

•The cenlor width is the genomic -coordinate width ol the central bin ol the triangle statistic. In theory, 
decreasing the center width should eliminate more oetocafeed nucleosomes from the results. In practice, 
center widths that are too small make the algorithm too selective. 

-The triangle- statistic cutoff ts the value of the triangle statistic below which the N-statistc is atways zero 
for a given scan window. 




FIGURE 3 I Continued 
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FIGURE 3 | (A) NSeq's welcome screen. (B) NSeq's load screen. (C) NSeq's run screen. (D) NSeq's display screen. 
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on a plot of nucleosome coverage (Figure 3D). A genome can 
be navigated by changing the chromosome number as well as 
the start and end coordinates of the plots. In the read-position 
histogram, positive-strand read counts are in blue, and negative- 
strand read counts are in red. The nucleosome coverage plot is 
obtained as described in Zhang et al. (2008b): NSeq extends a 
positive-strand read by 75 bases to the right, and a negative- 
strand read by 75 bases to the left. Here, "extend" means that 
each of the counts at included coordinates is incremented by 1 . 
The extended counts are then shifted by 37 bases to the right for 
positive-strand reads and 37 bases to the left for negative-strand 
reads. This gives rise to accumulations of counts near nucleosome 
centers, which are denoted by red lines in the bottom panel in 
Figure 3D. 

4.4. COMMAND-LINE INTERFACE 

NSeq can also be used at the command-line just to obtain the out- 
put TXT and WIG files. This option maybe preferred for batching 
jobs on a cluster. For information on the command-line interface, 
enter 

java -jar NSeq. jar -h 
at a prompt. 
5. METHODS 

We discuss how NSeq finds nucleosome centers in this section. 
The steps are summarized in Figure 4. Parameters from the load 
screen (Figure 3B) explained here are the scan-window width W, 
the center width B w , the critical triangle-statistic cutoff t c , the 
FDR cutoff F, and the number of simulations S. 
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FIGURE 4 | Steps taken by NSeq during nucleosome detection. NSeq 
runs through the steps summarized here to extract nucleosome centers 
from raw sequencing data. The plots in black representatively depict 
quantities computed across the genome. 



5.1. CONVERTING READS INTO A NUCLEOSOME CENTER 
PROBABILITY LANDSCAPE 

The alignment data are a snapshot of nucleosome locations in 
random samples from a cell population. But no nucleosome is 
exactly static with respect to DNA, and even well-localized nucle- 
osomes experience small shifts. In addition, a given read should 
correspond to one end of a nucleosome in one cell, but there is 
inherently some uncertainty in where linker DNA is cleaved by 
MNase. [Indeed, MNase cleavage sites are known to have a bias 
toward AT-rich regions (Horz and Altenburger, 1981)]. Consider 
a histogram of read counts whose bins are genomic-coordinate 
positions. A positive-strand alignment starts at the 5' end of 
the reference sequence and extends to the right; a negative- 
strand alignment starts at the 3' end of the reference sequence 
and extends to the left. So there should be minimally spread 
accumulations of reads (more precisely, alignment start-position 
counts) on either side of a positioned nucleosome. 

Our algorithm first converts each read location into a prob- 
ability distribution of the corresponding nucleosome center and 
attempts to capture the uncertainty in where the linker DNA is 
cut (Figure 5). Suppose y, is the 5'-end position of the rth read; 
then, allowing for 5 bp of ambiguity in either direction, we model 
the center of the corresponding nucleosome as a random variable 
Zj = y, + X, where X e {68, . . . , 78} and P(X = x) is obtained 
from discretizing the beta distribution: 



P(X = x) 



f* + \t - 68) a - 1 (79- tf- l dt 



B(a, f3)(ll) a 



+ P-i 



(1) 



Our choice of the beta distribution is strategic: we require 
an analytic distribution with a finite domain, flexible enough to 
describe the empirical distribution of nucleosomal DNA lengths, 
yet also fast to sample from. The parameter values 68 and 78 
were selected to accommodate the 5 -base-pair ambiguity from 
the expected nucleosome center located at 146/2. For a negative- 
strand read at yi, we model its center as Zj = y s —X. NSeq 
estimates a and (3 by using previously published paired-end nucle- 
osome sequencing data for S. cerevisiae (Henikoff et al., 2011). In 
these data, a given read pair should flank a full nucleosome. The 
genomic-coordinate distance between the reads in each pair was 
halved to obtain an empirical distribution of distances between 



+ strand read 




- strand read 


FIGURE 5 | Mapping reads to a relative probability distribution of 
nucleosome centers. A positive-strand alignment is mapped to a 
discretized beta distribution whose leftmost bin is 68 bins to the right of the 
read's start position (blue). A negative-strand alignment is mapped to a 
discretized beta distribution whose rightmost bin is 68 bins to the left of the 
read's start position (red). 
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reads and their corresponding probable nucleosome centers. We 
then obtained the maximum-likelihood estimates (MLEs) of a = 
1.9204 and f$ = 1.8937, by numerically solving the equations 



1 

N 



68 



and 



11 



iKP)-i|r(a + P), 



where i|r is a digamma function, and N is the total number 
of paired-end reads with their center location f, in the range 
{68, . . . , 78}; we used the Scipy optimize. fsolve module with 
default parameters. This approach thus models the fuzziness in 
nucleosome-center position for every read in the alignment data. 
The center probability densities are then summed at each genomic 
location k and provide a score Sjt defined as 



Sk 



£>(Z, = fc), 



(2) 



where _R is the total number of reads, resulting in a relative proba- 
bility landscape for the presence of nucleosome centers across the 
genome. 

5.2. TRIANGLE STATISTIC ON THE NUCLEOSOME CENTER 
PROBABILITY LANDSCAPE 

NSeq identifies nucleosomes by advancing a scan-window that 
spans W bins across the aforementioned nucleosome-center rel- 
ative probability landscape bin-by-bin, where each bin spans one 
basepair. Thus, two successive scan windows overlap by W — 1 
bins. A positioned nucleosome in the landscape should appear 
as clustered probability masses. We assess the statistical signifi- 
cance of such clustering by using what we call the triangle statistic, 
which is motivated by the scan statistic. Suppose we partition 
the scan-window of length W = 200 into disjoint sub-windows 
of lengths 75, 50, and 75 (Figure 6). The triangle statistic to be 



defined below detects significant accumulation of nucleosome 
center probability mass in the central 50 bp sub-window, which 
allows for roughly two superhelical turn ambiguity in either direc- 
tion. Although W = 200 by default, NSeq allows the user to 
change its value. 

Denote as A, B, and C the sums of probability masses in the 
first, second, and third sub-windows, respectively; and, let A w , 
B w , and C w be the corresponding lengths of sub-windows. By 
default,A w = 75, B w = 50, andC w = 75. Now, consider the odds 
B/A and B/C. For a uniform distribution of counts across the 
scan-window, B/A and B/C should both approach 2B W /(W — 
B w ) = 50/75. However, the presence of a positioned nucleosome 
preferentially puts probability masses in the central sub-window, 
so that both B/A and B/C are large compared to the null value of 
2B w /( W — B w ). The significance of this distortion is measured by 
our triangle statistic 



min (B/A, B/C) 
2B W /{W - B w ) ' 



(3) 



We call t the triangle statistic, because it is typically much 
larger than 1 for peaks in the probability landscape that look, 
roughly, like triangles. Importantly, by construction, the triangle 
statistic is small when the scan-window is centered around linker 
DNA or delocalized nucleosomes. 

5.3. IMPROVING THE ESTIMATES OF TRIANGLE STATISTIC 

B/A and B/C are MLEs of the odds. The MLE estimates have 
high variance when A, B, or C is small; NSeq thus uses median- 
unbiased estimates (MUE) which are known to be more robust 
and accurate for small sample data (Hirji et al., 1989; Parzen et al, 
2002). NSeq uses the MUE of odds ratios in the triangle statistic 
calculations. 

Estimating the odds can be mapped to the following problem: 
given a Bernoulli random variable with success probability p, esti- 
mate r = p/(l — p) from m successes out of total M i.i.d. trials. 
In our problem, both m and M are non-negative real numbers, 
but the formalism described below has a natural generalization to 
the continuous case. The MLE of r is m/{M — m). The MUE of 
r instead uses a MUE p of the success probability p to form the 
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FIGURE 6 1 How the triangle statistic works. Red dots denote the 
centers of discretized beta distributions corresponding to reads, and 
rounded rectangles denote nucleosomes. The triangle statistic divides 
a scan-window into regions A, B, and C of sizes (in bp) 75, 50, 



and 75, respectively. The greater the number of dots in B (more 
precisely, the sum of the probability masses at positions spanned by 
B) compared to the number of dots in both A and C, the greater 
the triangle statistic. 
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oddsp/(l — p). The MUEp satisfies the probability conditions 

Pr(p > p) > - and Pr(p < p) > - . 

Note that the number of successes after M Bernoulli trials is a 
sufficient statistic m 5 and is drawn from the binomial distribution. 
An alternative formulation of the MUE is obtained by consider- 
ing the distribution of the sufficient statistic m 5 . For the observed 
value m s = m, the MUEp occurs where (Hirji et al, 1989) 



Pr(m s < m\p = p) > - and Pr(m s > m \ p = p) > — . (4) 



need to modify the triangle statistic so that only one of the corre- 
lated windows would yield a significant statistic for a positioned 
nucleosome. A similar problem arises in the analysis of ChlP- 
chip or ChlP-seq data, and a Poisson-clumping approach was 
previously used to remove the positive correlation (Zhang, 2008). 
Inspired by that method, we define a new statistic N in terms of 
the triangle statistics. Let f; be the triangle statistic for the scan- 
window whose leftmost bin position is Define the new statistic 
Nj for the rth scan-window as 

n '=[ n i(t i < $ ) [ n /(f < - # ) i{t ' - *c) 

\j = >-25 I V='+ 1 / 



In general, some range of p satisfies the conditions, and the 
midpoint between its boundary values pi and p2 is taken as the 
MUE. Each of the boundary values pi and p2 occurs where one of 
the inequalities (Equation 4) is saturated: 



A/! 



Pr(m s < m\pi) = 

m s — ( 
M 

Pr(m, > m I pi) = 

F (M-m s Y.m s \ 



(M — m s )\m s \ 
M\ 



Pi (1 - pi) 



(5) 



The normalized incomplete beta function I x (a, f5) is defined as 

/ 0 *« a_1 (l - uf- l du 



I x (ot, P) 



iP-^l - u)$- l du 



Equation (5) can be rewritten in terms of I as 

1 

h (tn + l.M — m) = -; 

pi 2 



k 2 (m, M — m + 1) 



1 



These relations are numerically solvable for pi andp2 even for 
non-integer values of M and m, and the MUE p is determined 
from 



this formalism 



- Pi + P2 

To compute the triangle statistic, NSeq uses mis to 
with m = B and M = A + B, or m = B and M = C + B. 

5.4. REMOVING CORRELATIONS AMONG ADJACENT TRIANGLE 
STATISTICS 

Triangle statistics corresponding to scan windows that have sub- 
stantial overlap are correlated. For the S. cerevisiae data consid- 
ered in Results and other nucleosome sequencing data, we found 
that the autocorrelation length is 20-30 bins when W = 200. 
Several successive scan windows covering a single localized nucle- 
osome will thus all return large values of the triangle statistic. This 
correlation is a problem for FDR estimation procedures, which 
often assume independent samples of random variables. We thus 



x I 52 t,+68 +i - 10 7 \J2 t'-^+i - 10 ' (6) 



J =1 



v =1 



where I denotes an indicator function, and t c is a critical cutoff. 
N,- is either 1 or 0, and the Nj for successive windows are anti- 
correlated. Moreover, at most only one Nj € {Ni-25, . . . , Ni+25} 
is non-zero; the JV-statistic picks out clumps of scan windows 
with large triangle statistics. The last two indicator functions set 
Ni = 0 when there are similar clumps in the neighborhood of 
the ith window; they provide extra insurance against the pos- 
sibility of detecting overlapping or delocalized nucleosomes. In 
NSeq, the critical cutoff t c is by default 1.7, which we found to 
be sufficiently low to detect all nucleosomes below FDR 0.01. 
NSeq nominates all scan windows for which N, = 1 as candidate 
positioned nucleosomes, and then filters out candidates which 
are above the specified FDR cutoff F, as described below. Both 
P and t c can be toggled in the load window. 

5.5. COMPUTING FALSE DISCOVERY RATES 

An FDR associated with a candidate nucleosome is found by 
performing the following steps: 

FDR Estimation 

Let S = number of simulations; 

Let -R = total number of reads in the 

sequencing data; 

Let tj = triangle statistic associated with 
the jth candidate nucleosome. 

Let Mj = number of candidate nucleosomes with 
triangle statistic > tj. 
For k = 1, ... ,S: 
For i = 1, . . . ,R: 

Sample X ~ uniform distribution on 

{-73, . . . , 73}. 

Shift rth read bylto simulate delocalized 

nucleosomes . 
Run NSeq on the simulated data. 
Set mj-j = number of nucleosomes with 
triangle statistic > tj in the simulated 
data . 

Set FDR(Zj) = Em m k j/M r 
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